Skip to content

8351623: VectorAPI: Add SVE implementation of subword gather load operation #26236

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

XiaohongGong
Copy link

@XiaohongGong XiaohongGong commented Jul 10, 2025

This is a follow-up patch of [1], which aims at implementing the subword gather load APIs for AArch64 SVE platform.

Background

Vector gather load APIs load values from memory addresses calculated by adding a base pointer to integer indices. SVE provides native gather load instructions for byte/short types using int vectors for indices. The vector size for a gather-load instruction is determined by the index vector (i.e. int elements). Hence, the total size is 32 * elem_num bits, where elem_num is the number of loaded elements in the vector register.

Implementation

Challenges

Due to size differences between int indices (32-bit) and byte/short data (8/16-bit), operations must be split across multiple vector registers based on the target SVE vector register size constraints.

For a 512-bit SVE machine, loading a byte vector with different vector species require different approaches:

  • SPECIES_64: Single operation with mask (8 elements, 256-bit)
  • SPECIES_128: Single operation, full register (16 elements, 512-bit)
  • SPECIES_256: Two operations + merge (32 elements, 1024-bit)
  • SPECIES_512/MAX: Four operations + merge (64 elements, 2048-bit)

Use ByteVector.SPECIES_512 as an example:

  • It contains 64 elements. So the index vector size should be 64 * 32 bits, which is 4 times of the SVE vector register size.
  • It requires 4 times of vector gather-loads to finish the whole operation.
byte[] arr = [a, a, a, a, ..., a, b, b, b, b, ..., b, c, c, c, c, ..., c, d, d, d, d, ..., d, ...]
int[] idx = [0, 1, 2, 3, ..., 63, ...]

4 gather-load:
idx_v1 = [15 14 13 ... 1 0]    gather_v1 = [... 0000 0000 0000 0000 aaaa aaaa aaaa aaaa]
idx_v2 = [31 30 29 ... 17 16]  gather_v2 = [... 0000 0000 0000 0000 bbbb bbbb bbbb bbbb]
idx_v3 = [47 46 45 ... 33 32]  gather_v3 = [... 0000 0000 0000 0000 cccc cccc cccc cccc]
idx_v4 = [63 62 61 ... 49 48]  gather_v4 = [... 0000 0000 0000 0000 dddd dddd dddd dddd]
merge: v = [dddd dddd dddd dddd cccc cccc cccc cccc bbbb bbbb bbbb bbbb aaaa aaaa aaaa aaaa]

Solution

The implementation simplifies backend complexity by defining each gather load IR to handle one vector gather-load operation, with multiple IRs generated in the compiler mid-end.

Here is the main changes:

  • Enhanced IR generation with architecture-specific patterns based on gather_scatter_needs_vector_index() matcher.
  • Added VectorSliceNode for result merging.
  • Added VectorMaskWidenNode for mask spliting and type conversion for masked gather-load.
  • Implemented SVE match rules for subword gather operations.
  • Added comprehensive IR tests for verification.

Testing:

  • Passed hotspot::tier1/2/3, jdk::tier1/2/3 tests
  • No regressions found

Performance:

The performance of corresponding JMH benchmarks improve 3-11x on an NVIDIA GRACE CPU, which is a 128-bit SVE2 architecture. Following is the performance data:

Benchmark                                                 SIZE Mode   Cnt Unit   Before      After   Gain
GatherOperationsBenchmark.microByteGather128              64   thrpt  30  ops/ms 13500.891 46721.307 3.46
GatherOperationsBenchmark.microByteGather128              256  thrpt  30  ops/ms  3378.186 12321.847 3.64
GatherOperationsBenchmark.microByteGather128              1024 thrpt  30  ops/ms   844.871  3144.217 3.72
GatherOperationsBenchmark.microByteGather128              4096 thrpt  30  ops/ms   211.386   783.337 3.70
GatherOperationsBenchmark.microByteGather128_MASK         64   thrpt  30  ops/ms 10605.664 46124.957 4.34
GatherOperationsBenchmark.microByteGather128_MASK         256  thrpt  30  ops/ms  2668.531 12292.350 4.60
GatherOperationsBenchmark.microByteGather128_MASK         1024 thrpt  30  ops/ms   676.218  3074.224 4.54
GatherOperationsBenchmark.microByteGather128_MASK         4096 thrpt  30  ops/ms   169.402   817.227 4.82
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF  64   thrpt  30  ops/ms 10615.723 46122.380 4.34
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF  256  thrpt  30  ops/ms  2671.931 12222.473 4.57
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF  1024 thrpt  30  ops/ms   678.437  3091.970 4.55
GatherOperationsBenchmark.microByteGather128_MASK_NZ_OFF  4096 thrpt  30  ops/ms   170.310   813.967 4.77
GatherOperationsBenchmark.microByteGather128_NZ_OFF       64   thrpt  30  ops/ms 13524.671 47223.082 3.49
GatherOperationsBenchmark.microByteGather128_NZ_OFF       256  thrpt  30  ops/ms  3411.813 12343.308 3.61
GatherOperationsBenchmark.microByteGather128_NZ_OFF       1024 thrpt  30  ops/ms   847.919  3129.065 3.69
GatherOperationsBenchmark.microByteGather128_NZ_OFF       4096 thrpt  30  ops/ms   212.790   787.953 3.70
GatherOperationsBenchmark.microByteGather64               64   thrpt  30  ops/ms  8717.294 48176.937 5.52
GatherOperationsBenchmark.microByteGather64               256  thrpt  30  ops/ms  2184.345 12347.113 5.65
GatherOperationsBenchmark.microByteGather64               1024 thrpt  30  ops/ms   546.093  3070.851 5.62
GatherOperationsBenchmark.microByteGather64               4096 thrpt  30  ops/ms   136.724   767.656 5.61
GatherOperationsBenchmark.microByteGather64_MASK          64   thrpt  30  ops/ms  6576.504 48588.806 7.38
GatherOperationsBenchmark.microByteGather64_MASK          256  thrpt  30  ops/ms  1653.073 12341.291 7.46
GatherOperationsBenchmark.microByteGather64_MASK          1024 thrpt  30  ops/ms   416.590  3070.680 7.37
GatherOperationsBenchmark.microByteGather64_MASK          4096 thrpt  30  ops/ms   105.743   767.790 7.26
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF   64   thrpt  30  ops/ms  6628.974 48628.463 7.33
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF   256  thrpt  30  ops/ms  1676.767 12338.116 7.35
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF   1024 thrpt  30  ops/ms   422.612  3070.987 7.26
GatherOperationsBenchmark.microByteGather64_MASK_NZ_OFF   4096 thrpt  30  ops/ms   105.033   767.563 7.30
GatherOperationsBenchmark.microByteGather64_NZ_OFF        64   thrpt  30  ops/ms  8754.635 48525.395 5.54
GatherOperationsBenchmark.microByteGather64_NZ_OFF        256  thrpt  30  ops/ms  2182.044 12338.096 5.65
GatherOperationsBenchmark.microByteGather64_NZ_OFF        1024 thrpt  30  ops/ms   547.353  3071.666 5.61
GatherOperationsBenchmark.microByteGather64_NZ_OFF        4096 thrpt  30  ops/ms   137.853   767.745 5.56
GatherOperationsBenchmark.microShortGather128             64   thrpt  30  ops/ms  8713.480 37696.121 4.32
GatherOperationsBenchmark.microShortGather128             256  thrpt  30  ops/ms  2189.636  9479.710 4.32
GatherOperationsBenchmark.microShortGather128             1024 thrpt  30  ops/ms   545.435  2378.492 4.36
GatherOperationsBenchmark.microShortGather128             4096 thrpt  30  ops/ms   136.213   595.504 4.37
GatherOperationsBenchmark.microShortGather128_MASK        64   thrpt  30  ops/ms  6665.844 37765.315 5.66
GatherOperationsBenchmark.microShortGather128_MASK        256  thrpt  30  ops/ms  1673.950  9482.207 5.66
GatherOperationsBenchmark.microShortGather128_MASK        1024 thrpt  30  ops/ms   420.628  2378.813 5.65
GatherOperationsBenchmark.microShortGather128_MASK        4096 thrpt  30  ops/ms   105.128   595.412 5.66
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 64   thrpt  30  ops/ms  6699.594 37698.398 5.62
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 256  thrpt  30  ops/ms  1682.128  9480.355 5.63
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 1024 thrpt  30  ops/ms   421.942  2380.449 5.64
GatherOperationsBenchmark.microShortGather128_MASK_NZ_OFF 4096 thrpt  30  ops/ms   106.587   595.560 5.58
GatherOperationsBenchmark.microShortGather128_NZ_OFF      64   thrpt  30  ops/ms  8788.830 37709.493 4.29
GatherOperationsBenchmark.microShortGather128_NZ_OFF      256  thrpt  30  ops/ms  2199.706  9485.769 4.31
GatherOperationsBenchmark.microShortGather128_NZ_OFF      1024 thrpt  30  ops/ms   548.309  2380.494 4.34
GatherOperationsBenchmark.microShortGather128_NZ_OFF      4096 thrpt  30  ops/ms   137.434   595.448 4.33
GatherOperationsBenchmark.microShortGather64              64   thrpt  30  ops/ms  5296.860 37797.813 7.13
GatherOperationsBenchmark.microShortGather64              256  thrpt  30  ops/ms  1321.738  9602.510 7.26
GatherOperationsBenchmark.microShortGather64              1024 thrpt  30  ops/ms   330.520  2404.013 7.27
GatherOperationsBenchmark.microShortGather64              4096 thrpt  30  ops/ms    82.149   602.956 7.33
GatherOperationsBenchmark.microShortGather64_MASK         64   thrpt  30  ops/ms  3458.968 37851.452 10.94
GatherOperationsBenchmark.microShortGather64_MASK         256  thrpt  30  ops/ms   879.143  9616.554 10.93
GatherOperationsBenchmark.microShortGather64_MASK         1024 thrpt  30  ops/ms   220.256  2408.851 10.93
GatherOperationsBenchmark.microShortGather64_MASK         4096 thrpt  30  ops/ms    54.947   603.251 10.97
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF  64   thrpt  30  ops/ms  3521.856 37736.119 10.71
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF  256  thrpt  30  ops/ms   881.456  9602.649 10.89
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF  1024 thrpt  30  ops/ms   220.122  2409.030 10.94
GatherOperationsBenchmark.microShortGather64_MASK_NZ_OFF  4096 thrpt  30  ops/ms    55.845   603.126 10.79
GatherOperationsBenchmark.microShortGather64_NZ_OFF       64   thrpt  30  ops/ms  5279.815 37698.023 7.14
GatherOperationsBenchmark.microShortGather64_NZ_OFF       256  thrpt  30  ops/ms  1307.935  9601.551 7.34
GatherOperationsBenchmark.microShortGather64_NZ_OFF       1024 thrpt  30  ops/ms   329.707  2409.962 7.30
GatherOperationsBenchmark.microShortGather64_NZ_OFF       4096 thrpt  30  ops/ms    82.092   603.380 7.35

[1] https://bugs.openjdk.org/browse/JDK-8355563
[2] https://developer.arm.com/documentation/ddi0602/2024-12/SVE-Instructions/LD1B--scalar-plus-vector-Gather-load-unsigned-bytes-to-vector--vector-index--?lang=en
[3] https://developer.arm.com/documentation/ddi0602/2024-12/SVE-Instructions/LD1H--scalar-plus-vector---Gather-load-unsigned-halfwords-to-vector--vector-index--?lang=en


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8351623: VectorAPI: Add SVE implementation of subword gather load operation (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/26236/head:pull/26236
$ git checkout pull/26236

Update a local copy of the PR:
$ git checkout pull/26236
$ git pull https://git.openjdk.org/jdk.git pull/26236/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 26236

View PR using the GUI difftool:
$ git pr show -t 26236

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/26236.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Jul 10, 2025

👋 Welcome back xgong! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jul 10, 2025

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk openjdk bot added the rfr Pull request is ready for review label Jul 10, 2025
@openjdk
Copy link

openjdk bot commented Jul 10, 2025

@XiaohongGong The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Jul 10, 2025
@mlbridge
Copy link

mlbridge bot commented Jul 10, 2025

Webrevs

@XiaohongGong
Copy link
Author

Hi @Bhavana-Kilambi, @fg1417, could you please help take a look at this PR? BTW, since the vector register size of my SVE machine is 128-bit, could you please help test the correctness on a SVE machine with larger vector size (e.g. 512-bit vector size)? Thanks a lot in advance!

@Bhavana-Kilambi
Copy link
Contributor

Hi @XiaohongGong , thank you for doing this. As for testing, we can currently only test on 256-bit SVE machines (we no longer have any 512bit machines). We will get back to you with the results soon.

@XiaohongGong
Copy link
Author

Hi @XiaohongGong , thank you for doing this. As for testing, we can currently only test on 256-bit SVE machines (we no longer have any 512bit machines). We will get back to you with the results soon.

Testing on 256-bit SVE machines are fine to me. Thanks so much for your help!

ins_pipe(pipe_slow);
%}

instruct vmaskwiden_hi_sve(pReg dst, pReg src) %{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can both the hi and lo widen rules be combined into a single one as the arguments are the same? or would it make it less understandable?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main problem is that we cannot get the flag of __is_lo easily from the relative machnode as far as I know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I remember I had the same problem with requires_strict_order field in ReductionNodes. Thanks.

@@ -348,6 +347,12 @@ source %{
return false;
}

// SVE requires vector indices for gather-load/scatter-store operations
// on all data types.
bool Matcher::gather_scatter_needs_vector_index(BasicType bt) {
Copy link
Contributor

@Bhavana-Kilambi Bhavana-Kilambi Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's already a function that tests for UseSVE > 0 here -

static bool supports_scalable_vector() {

Can it be reused?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean directly using supports_scalable_vector instead of the new added method in mid-end? I'm afraid we cannot use it. Because on X86, the indexes for subword types are passed with address of the index array, while it's a vector for other types even on AVX-512.

But yes, we can call supports_scalable_vector() in the new added method for AArch64.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks! I missed the point that this was added in the mid-end.

@fg1417
Copy link

fg1417 commented Jul 15, 2025

@XiaohongGong thanks for your work! Tier1 - tier3 passed on 256-bit sve machine without new failures.

@fg1417
Copy link

fg1417 commented Jul 15, 2025

@XiaohongGong Please correct me if I’m missing something or got anything wrong.

Taking short on 512-bit machine as an example, these instructions would be generated:

// vgather
sve_dup vtmp, 0
sve_load_0 =>  [0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a]
sve_uzp1 with vtmp =>  [00 00 00 00 00 00 00 00 aa aa aa aa aa aa aa aa]

// vgather1
sve_dup vtmp, 0
sve_load_1 =>  [0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b]
sve_uzp1 with vtmp =>  [00 00 00 00 00 00 00 00 bb bb bb bb bb bb bb bb]

// Slice vgather1, vgather1
ext =>  [bb bb bb bb bb bb bb bb 00 00 00 00 00 00 00 00]

// Or vgather, vslice
sve_orr =>  [bb bb bb bb bb bb bb bb aa aa aa aa aa aa aa aa]

Actually, we can get the target result directly by uzp1 the output from sve_load_0 and sve_load_1, like

[0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a]
[0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b]
uzp1 => 
[bb bb bb bb bb bb bb bb aa aa aa aa aa aa aa aa]

If so, the current design of LoadVectorGather may not be sufficiently low-level to suit AArch64. WDYT?

@XiaohongGong
Copy link
Author

@XiaohongGong thanks for your work! Tier1 - tier3 passed on 256-bit sve machine without new failures.

Good! Thanks so much for your help!

@XiaohongGong
Copy link
Author

XiaohongGong commented Jul 16, 2025

@XiaohongGong Please correct me if I’m missing something or got anything wrong.

Taking short on 512-bit machine as an example, these instructions would be generated:

// vgather
sve_dup vtmp, 0
sve_load_0 =>  [0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a]
sve_uzp1 with vtmp =>  [00 00 00 00 00 00 00 00 aa aa aa aa aa aa aa aa]

// vgather1
sve_dup vtmp, 0
sve_load_1 =>  [0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b]
sve_uzp1 with vtmp =>  [00 00 00 00 00 00 00 00 bb bb bb bb bb bb bb bb]

// Slice vgather1, vgather1
ext =>  [bb bb bb bb bb bb bb bb 00 00 00 00 00 00 00 00]

// Or vgather, vslice
sve_orr =>  [bb bb bb bb bb bb bb bb aa aa aa aa aa aa aa aa]

Actually, we can get the target result directly by uzp1 the output from sve_load_0 and sve_load_1, like

[0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a]
[0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b]
uzp1 => 
[bb bb bb bb bb bb bb bb aa aa aa aa aa aa aa aa]

If so, the current design of LoadVectorGather may not be sufficiently low-level to suit AArch64. WDYT?

Yes, you are right! This can work for truncating and merging two gather load results. But we have to consider other scenarios together: 1) No merging 2) Need 4 times of gather-loads and merging. Additionally, we have to make LoadVectorGatherNode common sense for all scenarios and different architectures.

To make the IR itself simple and unify the inputs for all types on kinds of architectures, I choose to pass one index to it now, and define that one LoadVectorGatherNode just finish one time of gather-load with the index. The element type of the result should be the subword type. So a followed type truncating is needed anyway. I think this makes sense for a single gather-load operation for subword types, right?

For cases that need more than 1 time of gather, I choose to generate multiple LoadVectorGatherNode and do the merging at last. And, I agree this may make the code less efficient than that of implementing with one LoadVectorGatherNode for all different scenarios. Writing backend assemblers for all scenarios can be more efficient. But this makes the backend implementation more complex. In additional to four normal gather cases, we have to consider the corresponding masked version and partial cases. BTW, the number of index passed to LoadVectorGatherNode will be different (e.g. 1, 2, 4), which makes the IR itself not easy to maintain.

Regarding to the refinement based on your suggestion,

  • case-1: no merging
    • It's not an issue (current version is fine)
  • case-2: 2 times of gather and merge
    • Can be refined. But the LoadVectorGatherNode should be changed to accept 2 index vectors.
  • case-3: 4 times of gather and merge (only for byte)
    • Can be refined. We can implement it just like:
      step-1: v1 = gather1 + gather2 + 2 * uzp1 // merging the first and second gather-loads
      step-2: v2 = gather3 + gather4 + 2 * uzp1 // merging the third and fourth gather-loads
      step-3: v3 = slice (v2, v2), v = or(v1, v3) // do the final merging
      We have to change LoadVectorGatherNode as well. At least making it accept 2 index vectors.

As a summary, LoadVectorGatherNode will be more complex than before. But the good thing is, giving it one more index input is ok. I'm not sure whether this is appliable for other architectures like maybe RVV. But I can try with this change. Do you have better idea? Thanks!

@fg1417
Copy link

fg1417 commented Jul 16, 2025

  • case-2: 2 times of gather and merge

    • Can be refined. But the LoadVectorGatherNode should be changed to accept 2 index vectors.
  • case-3: 4 times of gather and merge (only for byte)

    • Can be refined. We can implement it just like:
      step-1: v1 = gather1 + gather2 + 2 * uzp1 // merging the first and second gather-loads
      step-2: v2 = gather3 + gather4 + 2 * uzp1 // merging the third and fourth gather-loads
      step-3: v3 = slice (v2, v2), v = or(v1, v3) // do the final merging
      We have to change LoadVectorGatherNode as well. At least making it accept 2 index vectors.

As a summary, LoadVectorGatherNode will be more complex than before. But the good thing is, giving it one more index input is ok. I'm not sure whether this is appliable for other architectures like maybe RVV. But I can try with this change. Do you have better idea? Thanks!

@XiaohongGong thanks for your reply.

This idea generally looks good to me.

For case-2, we have

gather1 + gather2 + uzp1:
[0a 0a 0a 0a 0a 0a 0a 0a]
[0b 0b 0b 0b 0b 0b 0b 0b]
uzp1.H  => 
[bb bb bb bb aa aa aa aa]

Can we improve case-3 by following the pattern of case-2?

step-1:  v1 = gather1 + gather2 + uzp1 
[000a 000a 000a 000a 000a 000a 000a 000a]
[000b 000b 000b 000b 000b 000b 000b 000b]
uzp1.H => [0b0b 0b0b 0b0b 0b0b 0a0a 0a0a 0a0a 0a0a]

step-2:  v2 = gather3 + gather4 + uzp1 
[000c 000c 000c 000c 000c 000c 000c 000c]
[000d 000d 000d 000d 000d 000d 000d 000d]
uzp1.H => [0d0d 0d0d 0d0d 0d0d 0c0c 0c0c 0c0c 0c0c]

step-3:  v3 = uzp1 (v1, v2)
[0b0b 0b0b 0b0b 0b0b 0a0a 0a0a 0a0a 0a0a]
[0d0d 0d0d 0d0d 0d0d 0c0c 0c0c 0c0c 0c0c]
uzp1.B => [dddd dddd cccc cccc bbbb bbbb aaaa aaaa]

Then we can also consistently define the semantics of LoadVectorGatherNode as gather1 + gather2 + uzp1.H , which would make backend much cleaner. WDYT?

@XiaohongGong
Copy link
Author

XiaohongGong commented Jul 17, 2025

  • case-2: 2 times of gather and merge

    • Can be refined. But the LoadVectorGatherNode should be changed to accept 2 index vectors.
  • case-3: 4 times of gather and merge (only for byte)

    • Can be refined. We can implement it just like:
      step-1: v1 = gather1 + gather2 + 2 * uzp1 // merging the first and second gather-loads
      step-2: v2 = gather3 + gather4 + 2 * uzp1 // merging the third and fourth gather-loads
      step-3: v3 = slice (v2, v2), v = or(v1, v3) // do the final merging
      We have to change LoadVectorGatherNode as well. At least making it accept 2 index vectors.

As a summary, LoadVectorGatherNode will be more complex than before. But the good thing is, giving it one more index input is ok. I'm not sure whether this is appliable for other architectures like maybe RVV. But I can try with this change. Do you have better idea? Thanks!

@XiaohongGong thanks for your reply.

This idea generally looks good to me.

For case-2, we have

gather1 + gather2 + uzp1:
[0a 0a 0a 0a 0a 0a 0a 0a]
[0b 0b 0b 0b 0b 0b 0b 0b]
uzp1.H  => 
[bb bb bb bb aa aa aa aa]

Can we improve case-3 by following the pattern of case-2?

step-1:  v1 = gather1 + gather2 + uzp1 
[000a 000a 000a 000a 000a 000a 000a 000a]
[000b 000b 000b 000b 000b 000b 000b 000b]
uzp1.H => [0b0b 0b0b 0b0b 0b0b 0a0a 0a0a 0a0a 0a0a]

step-2:  v2 = gather3 + gather4 + uzp1 
[000c 000c 000c 000c 000c 000c 000c 000c]
[000d 000d 000d 000d 000d 000d 000d 000d]
uzp1.H => [0d0d 0d0d 0d0d 0d0d 0c0c 0c0c 0c0c 0c0c]

step-3:  v3 = uzp1 (v1, v2)
[0b0b 0b0b 0b0b 0b0b 0a0a 0a0a 0a0a 0a0a]
[0d0d 0d0d 0d0d 0d0d 0c0c 0c0c 0c0c 0c0c]
uzp1.B => [dddd dddd cccc cccc bbbb bbbb aaaa aaaa]

Then we can also consistently define the semantics of LoadVectorGatherNode as gather1 + gather2 + uzp1.H , which would make backend much cleaner. WDYT?

Thanks! Regarding to the definitation of LoadVectorGatherNode, we'd better keep the vector type as it is for byte and short vectors. The SVE vector load gather instruction needs the type information. Additionally, the vector layout of the result should be matched with the vector type, right? We can handle this easily with pure backend implementation. But it seems not easy in mid-end IR level. BTW, uzp1 is SVE specific instruction, we'd better define a common IR for that, which is also useful for other platforms that want to support subword gather API, right? I'm not sure whether this makes sense. I will take a considering for this suggestion.

@XiaohongGong
Copy link
Author

XiaohongGong commented Jul 17, 2025

Can we improve case-3 by following the pattern of case-2?

step-1:  v1 = gather1 + gather2 + uzp1 
[000a 000a 000a 000a 000a 000a 000a 000a]
[000b 000b 000b 000b 000b 000b 000b 000b]
uzp1.H => [0b0b 0b0b 0b0b 0b0b 0a0a 0a0a 0a0a 0a0a]

step-2:  v2 = gather3 + gather4 + uzp1 
[000c 000c 000c 000c 000c 000c 000c 000c]
[000d 000d 000d 000d 000d 000d 000d 000d]
uzp1.H => [0d0d 0d0d 0d0d 0d0d 0c0c 0c0c 0c0c 0c0c]

step-3:  v3 = uzp1 (v1, v2)
[0b0b 0b0b 0b0b 0b0b 0a0a 0a0a 0a0a 0a0a]
[0d0d 0d0d 0d0d 0d0d 0c0c 0c0c 0c0c 0c0c]
uzp1.B => [dddd dddd cccc cccc bbbb bbbb aaaa aaaa]

Then we can also consistently define the semantics of LoadVectorGatherNode as gather1 + gather2 + uzp1.H , which would make backend much cleaner. WDYT?

Thanks! Regarding to the definitation of LoadVectorGatherNode, we'd better keep the vector type as it is for byte and short vectors. The SVE vector load gather instruction needs the type information. Additionally, the vector layout of the result should be matched with the vector type, right? We can handle this easily with pure backend implementation. But it seems not easy in mid-end IR level. BTW, uzp1 is SVE specific instruction, we'd better define a common IR for that, which is also useful for other platforms that want to support subword gather API, right? I'm not sure whether this makes sense. I will take a considering for this suggestion.

Maybe I can define the vector type of LoadVectorGatherNode as int vector type for subword types. An additional flag is necessary to denote whether it is a byte or short loading. It only finishes the gather operation (without any truncating). And define an IR like VectorConcateNode to merge the gather results. For cases that only one time of gather is needed, we can just return a type cast node like VectorCastI2X. Seems this will make the IR more common and code more clean.

The implementation would like:

  • case-1 one gather:
    • gather (bt: int) + cast (bt: byte|short)
  • case-2 two gathers:
    • step-1: gather1 (bt: int) + gather2 (bt: int) + concate(gather1, gather2) (bt: short)
    • step-2: cast (bt: byte) // just for byte vectors
  • case-3 four gathers:
    • step-1: gather1 (bt: int) + gather2 (bt: int) + concate(gather1, gather2) (bt: short)
    • step-2: gather3 (bt: int) + gather4 (bt: int) + concate(gather3, gather3) (bt: short)
    • step-3: concate (bt: byte)

Or more commonly:

  • case-1 one gather:
    • gather (bt: int) + cast (bt: byte|short)
  • case-2 two gathers:
    • step-1: gather1 (bt: int) + gather2 (bt: int) + concate(gather1, gather2) (bt: byte|short)
  • case-3 four gathers:
    • step-1: gather1 (bt: int) + gather2 (bt: int) + gather3 (bt: int) + gather4 (bt: int)
    • step-2: concate(gather1, gather2, gather3, gather4) (bt: byte|short)

From the IR level, which one do you think is better?

@fg1417
Copy link

fg1417 commented Jul 17, 2025

Thanks! Regarding to the definitation of LoadVectorGatherNode, we'd better keep the vector type as it is for byte and short vectors. The SVE vector load gather instruction needs the type information. Additionally, the vector layout of the result should be matched with the vector type, right? We can handle this easily with pure backend implementation. But it seems not easy in mid-end IR level. BTW, uzp1 is SVE specific instruction, we'd better define a common IR for that, which is also useful for other platforms that want to support subword gather API, right?

That makes sense to me. Thanks for your explanation!

Maybe I can define the vector type of LoadVectorGatherNode as int vector type for subword types. An additional flag is necessary to denote whether it is a byte or short loading. It only finishes the gather operation (without any truncating). And define an IR like VectorConcateNode to merge the gather results. For cases that only one time of gather is needed, we can just return a type cast node like VectorCastI2X. Seems this will make the IR more common and code more clean.

The implementation would like:

  • case-1 one gather:

    • gather (bt: int) + cast (bt: byte|short)
  • case-2 two gathers:

    • step-1: gather1 (bt: int) + gather2 (bt: int) + concate(gather1, gather2) (bt: short)
    • step-2: cast (bt: byte) // just for byte vectors
  • case-3 four gathers:

    • step-1: gather1 (bt: int) + gather2 (bt: int) + concate(gather1, gather2) (bt: short)
    • step-2: gather3 (bt: int) + gather4 (bt: int) + concate(gather3, gather3) (bt: short)
    • step-3: concate (bt: byte)

Or more commonly:

  • case-1 one gather:

    • gather (bt: int) + cast (bt: byte|short)
  • case-2 two gathers:

    • step-1: gather1 (bt: int) + gather2 (bt: int) + concate(gather1, gather2) (bt: byte|short)
  • case-3 four gathers:

    • step-1: gather1 (bt: int) + gather2 (bt: int) + gather3 (bt: int) + gather4 (bt: int)
    • step-2: concate(gather1, gather2, gather3, gather4) (bt: byte|short)

From the IR level, which one do you think is better?

I like this idea! The first one looks better, in which concate would provide lower-level and more fine-grained semantics, allowing us to define fewer IR node types while supporting more scenarios.

@XiaohongGong
Copy link
Author

I like this idea! The first one looks better, in which concate would provide lower-level and more fine-grained semantics, allowing us to define fewer IR node types while supporting more scenarios.

Yes, I agree with you. I'm now working on refactoring the IR based on the first idea. I will update the patch as soon as possible. Thanks for your valuable suggestion!

@fg1417
Copy link

fg1417 commented Jul 17, 2025

Yes, I agree with you. I'm now working on refactoring the IR based on the first idea. I will update the patch as soon as possible. Thanks for your valuable suggestion!

Thanks! I’d suggest also highlighting aarch64 in the JBS title, so others who are interested won’t miss it.

@XiaohongGong
Copy link
Author

Yes, I agree with you. I'm now working on refactoring the IR based on the first idea. I will update the patch as soon as possible. Thanks for your valuable suggestion!

Thanks! I’d suggest also highlighting aarch64 in the JBS title, so others who are interested won’t miss it.

Thanks for your point~
I'm not sure since this is not a pure AArch64 backend patch as I can see. Actually, the backend rules are so simple, and the mid-end IR change is relative more complex. Not sure whether this patch will be also missed by others that are not familiar with AArch64 if it is highlighted.

@XiaohongGong
Copy link
Author

Hi @fg1417 , the latest commit refactored the whole IR patterns and LoadVectorGather[Masked] IR based on above discussions. Could you please help take another look? Thanks~

Main changes

  • Type of LoadVectorGather[Masked] are changed from original subword vector type to int vector type. Additionally, a _mem_bt member is added to denote the load type.
    • backend rules are clean
    • mask generation for partial cases are clean
  • Define VectorConcatenateNode and remove VectorSliceNode.
    • VectorConcatenateNode has the same function with SVE/NEON's uzp1. It is used to narrow the element size of input to half size and concatenate narrowed results from src1 and src2 to dst (src1 is in lower part and src2 is in higher part of dst).
  • The matcher helper function vector_idea_reg_size() is needless and removed. Originally it is used by VectorSlice.
  • More IR tests are added for kinds of different vector species.

IR implementation

  • It needs one gather-load
    • LoadVectorGather (bt: int) + VectorCastI2X (bt: byte|short)
  • It needs two gather-loads and merge
    • step-1: v1 = LoadVectorGather (bt: int), v2 = LoadVectorGather (bt: int)
    • step-2: merge = VectorConcatenate(v1, v2) (bt: short)
    • step-3: (only byte) v = VectorCastS2X(merge) (bt: byte)
  • It needs four gather-loads and merge - (only byte vector)
    • step-1: v1 = LoadVectorGather (bt: int), v2 = LoadVectorGather (bt: int)
    • step-2: merge1 = VectorConcatenate(v1, v2) (bt: short)
    • step-3: v3 = LoadVectorGather (bt: int), v4 = LoadVectorGather (bt: int)
    • step-4: merge2 = VectorConcatenate(v3, v4) (bt: short)
    • step-5: v = VectorConcatenate(merge1, merge2) (bt: byte)

Performance change

It can observe about 4% ~ 9% uplifts on some micro benchmarks. No significant regressions are observed.
Following is the performance change on NVIDIA Grace with latest commit:

Benchmark                        (SIZE)   Mode   Units      Before     After   Gain
microByteGather128                   64  thrpt   ops/ms  48405.283  48668.502  1.005
microByteGather128                  256  thrpt   ops/ms  12821.924  12662.342  0.987
microByteGather128                 1024  thrpt   ops/ms   3253.778   3198.608  0.983
microByteGather128                 4096  thrpt   ops/ms    817.604    801.250  0.979
microByteGather128_MASK              64  thrpt   ops/ms  46124.722  48334.916  1.047
microByteGather128_MASK             256  thrpt   ops/ms  12152.575  12652.821  1.041
microByteGather128_MASK            1024  thrpt   ops/ms   3075.066   3193.787  1.038
microByteGather128_MASK            4096  thrpt   ops/ms    812.738    803.017  0.988
microByteGather128_MASK_NZ_OFF       64  thrpt   ops/ms  46130.244  48384.633  1.048
microByteGather128_MASK_NZ_OFF      256  thrpt   ops/ms  12139.800  12624.298  1.039
microByteGather128_MASK_NZ_OFF     1024  thrpt   ops/ms   3078.040   3203.049  1.040
microByteGather128_MASK_NZ_OFF     4096  thrpt   ops/ms    812.716    802.712  0.987
microByteGather128_NZ_OFF            64  thrpt   ops/ms  48369.524  48643.937  1.005
microByteGather128_NZ_OFF           256  thrpt   ops/ms  12814.552  12672.757  0.988
microByteGather128_NZ_OFF          1024  thrpt   ops/ms   3253.294   3202.016  0.984
microByteGather128_NZ_OFF          4096  thrpt   ops/ms    818.389    805.488  0.984
microByteGather64                    64  thrpt   ops/ms  48491.633  50615.848  1.043
microByteGather64                   256  thrpt   ops/ms  12340.778  13156.762  1.066
microByteGather64                  1024  thrpt   ops/ms   3067.592   3322.777  1.083
microByteGather64                  4096  thrpt   ops/ms    767.111    832.409  1.085
microByteGather64_MASK               64  thrpt   ops/ms  48526.894  50730.468  1.045
microByteGather64_MASK              256  thrpt   ops/ms  12340.398  13159.723  1.066
microByteGather64_MASK             1024  thrpt   ops/ms   3066.227   3327.964  1.085
microByteGather64_MASK             4096  thrpt   ops/ms    767.390    833.327  1.085
microByteGather64_MASK_NZ_OFF        64  thrpt   ops/ms  48472.912  51287.634  1.058
microByteGather64_MASK_NZ_OFF       256  thrpt   ops/ms  12331.578  13258.954  1.075
microByteGather64_MASK_NZ_OFF      1024  thrpt   ops/ms   3070.319   3345.911  1.089
microByteGather64_MASK_NZ_OFF      4096  thrpt   ops/ms    767.097    838.008  1.092
microByteGather64_NZ_OFF             64  thrpt   ops/ms  48492.984  51224.743  1.056
microByteGather64_NZ_OFF            256  thrpt   ops/ms  12334.944  13240.494  1.073
microByteGather64_NZ_OFF           1024  thrpt   ops/ms   3067.754   3343.387  1.089
microByteGather64_NZ_OFF           4096  thrpt   ops/ms    767.123    837.642  1.091
microShortGather128                  64  thrpt   ops/ms  37717.835  37041.162  0.982
microShortGather128                 256  thrpt   ops/ms   9467.160   9890.109  1.044
microShortGather128                1024  thrpt   ops/ms   2376.520   2481.753  1.044
microShortGather128                4096  thrpt   ops/ms    595.030    621.274  1.044
microShortGather128_MASK             64  thrpt   ops/ms  37655.017  37036.887  0.983
microShortGather128_MASK            256  thrpt   ops/ms   9471.324   9859.461  1.040
microShortGather128_MASK           1024  thrpt   ops/ms   2376.811   2477.106  1.042
microShortGather128_MASK           4096  thrpt   ops/ms    595.049    620.082  1.042
microShortGather128_MASK_NZ_OFF      64  thrpt   ops/ms  37636.229  37029.468  0.983
microShortGather128_MASK_NZ_OFF     256  thrpt   ops/ms   9483.674   9867.427  1.040
microShortGather128_MASK_NZ_OFF    1024  thrpt   ops/ms   2379.877   2478.608  1.041
microShortGather128_MASK_NZ_OFF    4096  thrpt   ops/ms    594.710    620.455  1.043
microShortGather128_NZ_OFF           64  thrpt   ops/ms  37706.896  37044.505  0.982
microShortGather128_NZ_OFF          256  thrpt   ops/ms   9487.006   9882.079  1.041
microShortGather128_NZ_OFF         1024  thrpt   ops/ms   2379.571   2482.341  1.043
microShortGather128_NZ_OFF         4096  thrpt   ops/ms    595.099    621.392  1.044
microShortGather64                   64  thrpt   ops/ms  37773.485  37502.698  0.992
microShortGather64                  256  thrpt   ops/ms   9591.046   9640.225  1.005
microShortGather64                 1024  thrpt   ops/ms   2406.013   2420.376  1.005
microShortGather64                 4096  thrpt   ops/ms    603.270    606.541  1.005
microShortGather64_MASK              64  thrpt   ops/ms  37781.860  37479.295  0.991
microShortGather64_MASK             256  thrpt   ops/ms   9608.015   9657.010  1.005
microShortGather64_MASK            1024  thrpt   ops/ms   2406.828   2422.170  1.006
microShortGather64_MASK            4096  thrpt   ops/ms    602.965    606.283  1.005
microShortGather64_MASK_NZ_OFF       64  thrpt   ops/ms  37740.577  37487.740  0.993
microShortGather64_MASK_NZ_OFF      256  thrpt   ops/ms   9593.611   9663.041  1.007
microShortGather64_MASK_NZ_OFF     1024  thrpt   ops/ms   2404.846   2423.493  1.007
microShortGather64_MASK_NZ_OFF     4096  thrpt   ops/ms    602.691    605.911  1.005
microShortGather64_NZ_OFF            64  thrpt   ops/ms  37723.586  37507.899  0.994
microShortGather64_NZ_OFF           256  thrpt   ops/ms   9589.985   9630.033  1.004
microShortGather64_NZ_OFF          1024  thrpt   ops/ms   2405.774   2423.655  1.007
microShortGather64_NZ_OFF          4096  thrpt   ops/ms    602.778    606.151  1.005

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-compiler hotspot-compiler-dev@openjdk.org rfr Pull request is ready for review
Development

Successfully merging this pull request may close these issues.

3 participants